Linear Regression

Randy Johnson

6/21/2018

Overview

Least Squares Intuition

Sand flea data from McDonald, J.H. 1989. Selection component analysis of the Mpi locus in the amphipod Platorchestia platensis. Heredity 62: 243-249.

See the Handbook of Biological Statistics for additional examples.

Least Squares Intuition

Least Squares Intuition

Model Notation

eggs = β0 + β1 mg + ε

Model Output

eggs = β0 + β1 mg + ε

## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  12.6890     4.2009   3.021   0.0056 **
## mg            1.6017     0.6176   2.593   0.0154 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

eggs = 12.7 + 1.6*mg + ε

Assumption 1: Error terms are normally distributed

Two ways to test this assumption:

## 
##  Shapiro-Wilk normality test
## 
## data:  .std.resid
## W = 0.93555, p-value = 0.08517

See my code and ?shapiro.test for more details.

Assumption 1: Error terms are normally distributed

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 16 3.425071          0.0021295     0.059625

See my code and ?outlierTest from the car package for more details.

Influence

Influence

Influence

We see our borderline significant outlier (#16) has very little influence over the coefficients, but it does result in a change in the error terms.

* The trend line is not changed much if we exclude this data point from the analysis.
* It does shift the line slightly up, which results in a slightly larger estimate for all data points.

The 24th data point is also highlighted here. It has a larger residual (not an outlier) and is on the edge of the distribution. This gives it increased leverage and influence.

See my code and ?influencePlot from the car package for more details.

Assumption 1: Error terms are normally distributed

We see our borderline outlier in the top right corner is larger than we would expect it to be, but not quite outside of the confidence region within which we can expect our error terms to fall.

See my code and ?qqPlot from the car package for more details.

Assumption 2: Linear Relationship

size = β0 + β1 length + ε

size = β0 + β1 length + β2 length2 + ε

Data from Ashton, K.G., R.L. Burke, and J.N. Layne. 2007. Geographic variation in body and clutch size of gopher tortoises. Copeia 2007: 355-363

Assumption 2: Linear Relationship

size = β0 + β1 length + ε

## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.59974   19.53898   0.031    0.976
## length       0.02435    0.06316   0.386    0.706

size = β0 + β1 length + β2 length2 + ε

## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9.634e+02  2.958e+02  -3.257  0.00625 **
## length       6.264e+00  1.913e+00   3.275  0.00603 **
## length2     -1.008e-02  3.088e-03  -3.263  0.00617 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Component Residual Plots

size = β0 + β1 length + ε

Component Residual Plots

size = β0 + β1 length + β2 length2 + ε

Multivariate Normality

mvnExample <- data_frame(x1 = rnorm(100),
                         y1 = x1 + rnorm(100), # normal error terms
                         y2 = x1 + rt(100, 3)) # not-normal error terms

Multivariate Normality - Diagnostics (good)

## 
##  Shapiro-Wilk normality test
## 
## data:  .std.resid
## W = 0.98544, p-value = 0.3417

Multivariate Normality - Diagnostics (bad)

## 
##  Shapiro-Wilk normality test
## 
## data:  .std.resid
## W = 0.91781, p-value = 1.079e-05

Multivariate Normality - Model comparison

Model 1 has normal error terms.

## 
## Call:
## lm(formula = y1 ~ x1, data = mvnExample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7996 -0.6230  0.1467  0.7049  1.9721 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.15329    0.09750  -1.572    0.119    
## x1           0.91487    0.09161   9.987   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9619 on 98 degrees of freedom
## Multiple R-squared:  0.5044, Adjusted R-squared:  0.4993 
## F-statistic: 99.74 on 1 and 98 DF,  p-value: < 2.2e-16

Multivariate Normality - Model comparison

Model 2 has non-normal error terms.

## 
## Call:
## lm(formula = y2 ~ x1, data = mvnExample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4554 -1.1273 -0.1941  1.0091  8.7573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1716     0.1932   0.888    0.377    
## x1            0.8026     0.1815   4.421 2.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.906 on 98 degrees of freedom
## Multiple R-squared:  0.1663, Adjusted R-squared:  0.1578 
## F-statistic: 19.55 on 1 and 98 DF,  p-value: 2.539e-05

No/Little Multicollinearity

mcol <- data_frame(x1 = rnorm(100),
                   x2 = x1 + rnorm(100), # x1 and x2 are correlated
                   x3 = rnorm(100),
                   y = x1 + x2 + x3 + rnorm(100))

# three models
model1 <- lm(y ~ x1 + x2 + x3, data = mcol) # everything
model2 <- lm(y ~ x1 +      x3, data = mcol) # without x2
model3 <- lm(y ~      x2 + x3, data = mcol) # without x1

No/Little Multicollinearity - Variance Inflation Factors

# check for multicollinearity
vif(model1)
##       x1       x2       x3 
## 2.271759 2.271338 1.005520
vif(model2)
##       x1       x3 
## 1.000791 1.000791
vif(model3)
##       x2       x3 
## 1.000606 1.000606

No/Little Multicollinearity - Model comparison

model x1 x2 x3 r2
1 0.94 1.02 0.98 0.87
2 1.88 0.93 0.76
3 1.59 1.03 0.81

Model coefficients for our three models. Notice how the coefficient for x3 doesn’t change much, but the coefficients for x1, x2 and R2 change quite a bit.

No Autocorrelation

No Autocorrelation

Adding a loess smoother highlights this periodic pattern in the data.

No Autocorrelation

lm(y1 ~ x1, data = dat) %>%
    durbinWatsonTest()
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1940815      1.609687   0.002
##  Alternative hypothesis: rho != 0

Homoscedasticity

Homoscedasticity

y1 ~ x1

## 
## Suggested power transformation:  1.030183
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.05980537    Df = 1     p = 0.8068038

Homoscedasticity

y2 ~ x1

## 
## Suggested power transformation:  0.1781482
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 42.36333    Df = 1     p = 7.579794e-11

Homoscedasticity

## 
## Call:
## lm(formula = y1 ~ x1, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.86192 -0.55272 -0.04338  0.54027  2.54020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.04454    0.20709   0.215  0.82991   
## x1           0.21653    0.06599   3.281  0.00122 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9762 on 198 degrees of freedom
## Multiple R-squared:  0.05158,    Adjusted R-squared:  0.04679 
## F-statistic: 10.77 on 1 and 198 DF,  p-value: 0.00122

Homoscedasticity

## 
## Call:
## lm(formula = y2 ~ x1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2229 -1.1031 -0.2710  0.8684 10.1981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3360     0.4025   0.835    0.405    
## x1            0.6483     0.1282   5.055 9.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.897 on 198 degrees of freedom
## Multiple R-squared:  0.1143, Adjusted R-squared:  0.1098 
## F-statistic: 25.56 on 1 and 198 DF,  p-value: 9.755e-07

Transformations

Many of the problems we encounter in linear regression stem from model choice.

Two most common transormations:

Power Transformations

Original model: y ~ β0 + β1x + ε

With power transformation: y ~ β0 + β1x2 + ε

Power Transformations

Original model: y ~ β0 + β1x + ε

With power transformation: y ~ β0 + β1sqrt(x) + ε

Log Transformations

Original model: y ~ β0 + β1x + ε

With log transformation: y ~ β0 + β1ln(x) + ε

Log Transformations

Original model: y ~ β0 + β1x + ε

With log transformation: ln(y) ~ β0 + β1x + ε

Assumptions Summary

Assumption Cause Consequence Diagnosis
Linear Relationship Incorrect model Inaccurate predictions car::crPlots()
Multivariate Normality Incorrect model Incorrect statistics car::qqPlot()
Noisy data (p-values / CIs) shapiro.test(residuals)
No/Little Multicollinearity Correlated variables Unstable model coefficients car::vif()
No Autocorrelation Non-independent data Inefficient estimators car::durbinWatsonTest()
Homoscedasticity Incorrect model Incorrect statistics car::ncvTest()
“Bad” data (p-values / CIs) car::spreadLevelPlot()

Thanks

Statistics for Lunch Team * Greg Alvord * Eckart Bindewald * Taina Immonen * Brian Luke * George Nelson * Ravi Ravichandran * Tom Schneider